3  How To Visualise Numerical Data

We have learnt how to visualise data using tables, which are useful for both analysts and readers when examining a dataset.

However, tables are not always the most effective way to quickly and intuitively understand patterns or trends in the data. This is where figures become valuable.

In this chapter, we will focus on different types of figures that can be used to represent quantitative variables. For plot interpretation, refer to the Inferential Statistics With R short course.

4 Rice

Photo by Zhao Yangjun on Unsplash

In this chapter, we are going to explore the rice dataset.

Try loading in the rice_data data yourself. Name the dataframe object rice. Then, answer the questions below.

rice <- read.csv("Data-sets/rice_data.csv") str(rice)
rice <- read.csv("Data-sets/rice_data.csv")
str(rice)
  1. How many variables are in the rice dataset?
  2. How many observations are in the rice dataset?
  3. How many categorical variables are there? (Hint: they may not be stored correctly…yet)
  4. How many categorical variables are there? (Hint: they may not be stored correctly…yet)
  5. What type of variable has R stored them?

As can be seen, there is some cleaning up to do! You may also notice that there are observations that are “*”. This represents a missing value. We can convert it to NA and then drop it from our dataset.

Fill in the blanks below.

library(tidyverse) rice <- rice %>% mutate(across(where(is.character), ~ na_if(.x, "*"))) %>% drop_na() %>% mutate(across( c( Ref, Experiment.site, Year, Exp.name, Sowing.method, RANGE, ROW, REP, Variety, Nitrogen, N.group, Lodging..10.flat. ), as.factor )) %>% mutate(across(where(is.character), as.numeric))

library(tidyverse)

rice <- rice %>%
  mutate(across(where(is.character), ~ na_if(.x, "*"))) %>%
  drop_na() %>%
  mutate(across(
    c(
      Ref, Experiment.site, Year, Exp.name, Sowing.method,
      RANGE, ROW, REP, Variety, Nitrogen, N.group, Lodging..10.flat.
    ),
    as.factor
  )) %>%
  mutate(across(where(is.character), as.numeric))

Let’s check the structure again to make sure everything has been cleaned up correctly.

str(rice)
str(rice)

Yay! Now that the dataset is cleaned, we can move onto plotting.

One Quantitative Variable

When you have a single numerical variable, whether discrete or continuous, a histogram is a great way to visualise its distribution.

In R, you can create plots using base R. For example, to see a visualisation of plant height, you can simply type:

hist(rice$Plant.height..cm.)

to generate a histogram.

While base R plots are fine for basic analysis,ggplot2 is one of the most commonly used plotting packages and is widely accepted in published reports.

The histogram looks a bit plain without any color, and the x-axis label isn’t very informative. Let’s add some color, fix the labels, and give it a clean white background.

Load in the package and then run the code below.

library(ggplot2) ggplot(data=rice, aes(x=Plant.height..cm.))+ geom_histogram(fill="#F28123", col="#D34E24", bins=30)+ labs(x="Rice Plant Height (cm)")+ theme_bw()
library(ggplot2)

ggplot(data=rice, aes(x=Plant.height..cm.))+
  geom_histogram(fill="#F28123", col="#D34E24", bins=30)+
  labs(x="Rice Plant Height (cm)")+
  theme_bw()

We can see that this is much better!

It is very important that when you create a plot, you provide meaningful labels with units. You can view below how different bins influence

ggplot(data=rice, aes(x=Plant.height..cm.))+
  geom_histogram(fill="#F28123", col="#D34E24", bins=10)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+
  theme_bw()

ggplot(data=rice, aes(x=Plant.height..cm.))+
  geom_histogram(fill="#F28123", col="#D34E24", bins=100)+
  labs(x="Rice Plant Height (cm)", y="Frequency")+
  theme_bw()

Two Quantitative Variables

To visualise two quantitative variables, a scatterplot is appropriate.

Fill in the blanks below to create a scatterplot of HI versus Plant Growth.

ggplot(data=rice, aes(x=HI, y=Plant.height..cm.))+ geom_point(fill="#EADF0B", col="#563F1B", pch=21, size=3)+ labs(x="HI",y="Rice Plant Height (cm)")+ theme_bw()
ggplot(data=rice, aes(x=HI, y=Plant.height..cm.))+
  geom_point(fill="#EADF0B", col="#563F1B", pch=21, size=3)+
  labs(x="HI",y="Rice Plant Height (cm)")+
  theme_bw()

5 Palmer Penguins

# Load the library

library(ggplot2)
library(palmerpenguins)
clean_penguins <- na.omit(penguins)

# We can plot the histogram
ggplot(clean_penguins, aes(x=body_mass_g))+
  geom_histogram()

# We can plot the histogram
ggplot(clean_penguins, aes(x=body_mass_g))+
  geom_histogram(fill="orangered", bins=30) +
  labs(x="Body Mass of Penguins (g)")+
  theme_bw()

ggplot(penguins, aes(body_mass_g, fill = species)) +
  geom_histogram(binwidth = 100)

ggplot(penguins, aes(body_mass_g, colour = species)) +
  geom_freqpoly(binwidth = 100)

ggplot(penguins, aes(body_mass_g, after_stat(density), colour = species)) +
  geom_freqpoly(binwidth = 100)

Scatterplots

What if we’re interested in how body mass relates to flipper length? In this case, we’re working with two continuous numerical variables, so a scatterplot would be a good choice.

ggplot(clean_penguins, aes(x=body_mass_g, y=flipper_length_mm))+
  geom_point()

Like we did with the histogram, we can add color, change the shape of the points, and include informative labels.

ggplot(clean_penguins, aes(x=body_mass_g, y=flipper_length_mm))+
  geom_point(fill="aquamarine", col="aquamarine4", pch=21, size=2.5)+
  labs(x="Body Mass of Penguins (g)", y="Flipper Length of Penguins (mm)")+
  theme_bw()

We can see that as the body mass of a penguin increases, its flipper length also tends to increase. This suggests a strong positive linear relationship.

We can customise the scatterplot even further by using the mapping argument to colour the points by species type. To visualise the linear relationship, we include method = "lm" in geom_smooth(), which fits a linear model (a straight line) to the data.

ggplot(
  penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm")

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  scale_color_manual(
    values = c(
      Adelie = "#03045e",
      Chinstrap = "#0096c7",
      Gentoo = "#90e0ef"
    )
  )

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island)) +
  scale_colour_brewer(palette = "Set1")

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island) +
  scale_colour_viridis_d()

Colour blind

library(ggthemes)
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island) +
  scale_colour_colorblind()

library("scales")

show_col(colorblind_pal()(8))